# library(ggplot2)
library(RColorBrewer)
library(ISLR)
library(latex2exp)
# theme_set(theme_bw(base_size = 14))
cols <- brewer.pal(9, "Set1")

Why are we interested in selecting only some variables in the first place?

  1. To improve prediction accuracy:

    • If \(n \gg p\), the least squares estimates tend to have low variance, and hence will perform well on test observations

    • If \(n\) is not much larger than \(p\), then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training.

    • If \(n < p\), then there is no longer a unique least squares coefficient estimate: the method cannot be used at all.

  2. To improve model interpretability: often some or many of the variables used in a multiple regression model are not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables, i.e. by setting the corresponding coefficient estimates to zero, we can obtain a model that is more easily interpreted.

There are two possible solutions: subset selection, or shrinkage of the estimated coefficients. Using these methods, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.


Covariate selection

Best Subset selection

Let us try it out on the Hitters data: we try to explain the salaries of baseball players.

There are some missing values here, so before we proceed we will remove them:

Hitters <- na.omit(Hitters)

We will now use the package leaps to evaluate all the best-subset models.

To perform best subset selection, we fit a separate least squares regression for each possible combination of the \(p\) predictors. That is, we fit all \(p\) models that contain exactly one predictor, all \({p \choose 2} = p(p−1)/2\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.

For fixed model dimension, we can choose the best model via RSS (which is a valid criterion for comparison). When comparing models of different sizes, we must use a penalized criterion (AIC, BIC, \(C_{p}\)).

library(leaps)
regfit_full <- regsubsets(Salary ~ ., data = Hitters, method = 'exhaustive')
plot(regfit_full, scale = "Cp")

It gives by default best-subsets up to size \(8\); let’s increase that to \(p = 19\), i.e. all the variables, using the argument nvmax = p.

p <- ncol(Hitters) - 1
regfit_full <- regsubsets(Salary ~ ., data = Hitters, nvmax = p)
reg_summary <- summary(regfit_full)

plot(reg_summary$cp, xlab = "Model size", ylab = bquote(C[p]))
which.min(reg_summary$cp)
[1] 10
points(which.min(reg_summary$cp), reg_summary$cp[which.min(reg_summary$cp)], 
       pch = 20, col = cols[1])

There is also a plot method for the regsubsets object. This shows graphically what are the relevant predictors, according to the method of choice (in this case, \(C_{p}\)). This makes it easier to see which one is the “best” model.

plot(regfit_full, scale = "Cp")

coef(regfit_full, which.min(reg_summary$cp))
 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
        CRBI       CWalks    DivisionW      PutOuts      Assists 
   0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680 

Stepwise Selection

For computational reasons, best subset selection cannot be applied with very large \(p\). Stepwise methods explore a far more restricted set of model. In leaps, we can also specify what method we want to use. For example, if we want to perform backward selection, we can use:

fit_backward <- regsubsets(Salary ~ ., data = Hitters, nvmax = p, method = "backward")
backward <- cbind(summary(fit_backward)$which,
                  summary(fit_backward)$cp,
                  summary(fit_backward)$bic)
colnames(backward) <- c(colnames(Hitters), 'Cp', 'BIC')
backward
   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks
1      1    0     0    0   0     0     0      0     0      0     0    1      0
2      1    0     1    0   0     0     0      0     0      0     0    1      0
3      1    0     1    0   0     0     0      0     0      0     0    1      0
4      1    1     1    0   0     0     0      0     0      0     0    1      0
5      1    1     1    0   0     0     1      0     0      0     0    1      0
6      1    1     1    0   0     0     1      0     0      0     0    1      0
7      1    1     1    0   0     0     1      0     0      0     0    1      0
8      1    1     1    0   0     0     1      0     0      0     0    1      1
9      1    1     1    0   0     0     1      0     1      0     0    1      1
10     1    1     1    0   0     0     1      0     1      0     0    1      1
11     1    1     1    0   0     0     1      0     1      0     0    1      1
12     1    1     1    0   1     0     1      0     1      0     0    1      1
13     1    1     1    0   1     0     1      0     1      0     0    1      1
14     1    1     1    1   1     0     1      0     1      0     0    1      1
15     1    1     1    1   1     0     1      0     1      1     0    1      1
16     1    1     1    1   1     1     1      0     1      1     0    1      1
17     1    1     1    1   1     1     1      0     1      1     0    1      1
18     1    1     1    1   1     1     1      1     1      1     0    1      1
19     1    1     1    1   1     1     1      1     1      1     1    1      1
   League Division PutOuts Assists Errors Salary NewLeague         Cp
1       0        0       0       0      0      0         0 106.874632
2       0        0       0       0      0      0         0  56.314938
3       0        0       0       1      0      0         0  40.279613
4       0        0       0       1      0      0         0  32.675465
5       0        0       0       1      0      0         0  25.224013
6       0        0       1       1      0      0         0  18.835412
7       1        0       1       1      0      0         0  13.398979
8       1        0       1       1      0      0         0   7.624674
9       1        0       1       1      0      0         0   6.158685
10      1        0       1       1      1      0         0   5.009317
11      1        1       1       1      1      0         0   5.874113
12      1        1       1       1      1      0         0   7.330766
13      1        1       1       1      1      1         0   8.888112
14      1        1       1       1      1      1         0  10.481576
15      1        1       1       1      1      1         0  12.346193
16      1        1       1       1      1      1         0  14.187546
17      1        1       1       1      1      1         1  16.087831
18      1        1       1       1      1      1         1  18.011425
19      1        1       1       1      1      1         1  20.000000
          BIC
1   -88.97559
2  -124.18997
3  -134.21006
4  -137.33435
5  -140.61064
6  -143.14927
7  -144.98259
8  -147.38199
9  -145.44316
10 -143.21651
11 -138.86077
12 -133.87283
13 -128.77759
14 -123.64420
15 -118.21832
16 -112.81768
17 -107.35339
18 -101.86391
19  -96.30412
plot(fit_backward, scale = "Cp")

Remember that backward stepwise selection begins with a full model, and then removes predictors to the model, one-at-a-time, until none of the predictors are in the model. At each step, the predictor to remove is chosen in order to minimize the increase in RSS.

This table is equivalent to the graphical representation just seen.

We can also do forward selection:

fit_forward <- regsubsets(Salary ~ ., data = Hitters, nvmax = p, method = "forward")
plot(fit_forward, scale = "Cp")

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Model Selection Using a Cross Validation

We can also directly estimate the test error using cross-validation. We can compute the cross-validation error for each optmial model of each size, and then select the model for which the resulting estimated test error is smallest. This procedure has an advantage relative to AIC, BIC, \(C_{p}\), and adjusted \(R^{2}\), in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model.

For every fold \(k = 1, \dots, K\):

  1. Using only data not in fold \(k\) (training set), find the optimal models for every size \(j = 1, \dots, p\)
  2. For every one of these models of different sizes:
    1. make prediction on the data in fold \(k\) (test set)
    2. compute the test MSE

We have to do a bit of work here, because there is no predict method for regsubsets. This was a little tedious, so we will write one!

predict_regsubsets <- function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coef_i <- coef(object, id = id)
  return (mat[,names(coef_i)] %*% coef_i)
}

Remember cross-validation from last time? Let’s try it out here!

n <- nrow(Hitters)
K <- 20
folds <- cut(sample(1:n), breaks = K, labels = FALSE)
table(folds)
folds
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
14 13 13 13 13 13 13 13 13 14 13 13 13 13 13 13 13 13 13 14 
cv_err <- array(NA, dim = c(K, p))
for(k in 1:K){
  idx_tr <- which(folds != k)
  idx_ts <- which(folds == k)
    
  # find the optimal models of each size
  best_fit <- regsubsets(Salary ~ ., data = Hitters[idx_tr,], nvmax = p, method = "forward")
  for(i in 1:p){
    pred <- predict_regsubsets(best_fit, Hitters[idx_ts,], id = i)
    cv_err[k,i] <- mean((Hitters$Salary[idx_ts] - pred)^2)
  }
}
cv_RMSE <- colMeans(cv_err)
sd_cv_RMSE <- apply(cv_err, 2, sd)/sqrt(K)

library(plotrix)
plotCI(x = 1:p, y = cv_RMSE, uiw = sd_cv_RMSE, col = 'red', scol = 'gray', 
       pch = 16, lwd = 2, xlab = 'Model size', ylab = 'CV Error')
# Trace the line corresponding to the optimal model size
abline(v = which.min(cv_RMSE), lwd = 1, col = 1, lty = 2)
# Look for the indexes before the optimal one which have a mean which is still 
# lower than the optimal MSE + one dev. std
abline(h = min(cv_RMSE) + sd_cv_RMSE[which.min(cv_RMSE)], lty = 2)
idx <- which( min(cv_RMSE) + sd_cv_RMSE[which.min(cv_RMSE)] >= cv_RMSE)
# Pick the first of these indexes (i.e. smallest model) which is TRUE
idx <- idx[1]
# Trace the line corresponding to the most conservative model among the ones 
# with error within 1-std. dev of the optimal error
abline(v = idx, lwd = 1, col = 1, lty = 3)

Shrinkage Methods

Ridge regression

Consider the standard linear regression model \[y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \mathbf{\varepsilon}.\]

The ridge regression involves estimating \(\mathbf{\beta}\) as the solution to the penalized least-squares problem \[\hat{\mathbf{\beta}} = \text{argmin}_{\mathbf{\beta}} \sum_{i=1}^n \left(y_i - \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right)^2 + \lambda ||β||_{2}^{2}\]

This penalty function will have the effect of shrinking the nonzero \(\beta_{i}\)’s toward 0. The bigger \(\lambda\), the more aggressive the shrinkage effect.

We will use the package glmnet, which does not use the model formula language, so we will set up X and y.

rm(list=ls())
library(glmnet)
cols <- brewer.pal(9, "Set1")
data(Hitters)
Hitters <- na.omit(Hitters)

X <- model.matrix(Salary ~ . -1, data = Hitters)
y <- Hitters$Salary

p <- ncol(X)

First we will fit a ridge-regression model across a range of \(\lambda\) values and we plot the solution path \(\hat{\mathbf{\beta}}^{(\lambda)}\) as a function of \(\lambda\). In addition, we track the in-sample mean-squared prediction error of the fit across the solution path:

\[\text{MSE}(\hat{\mathbf{\beta}}^{(\lambda)}) = \dfrac{1}{n} \sum_{i=1}^{n} (y_i - \hat{\beta}_0^{(\lambda)} - \hat{\beta}_1^{(\lambda)} x_{i1} - \dots - \hat{\beta}_p^{(\lambda)} x_{ip})^{2}\]

This is achieved by calling glmnet with alpha = 0. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize=FALSE.

fit_ridge <- glmnet(X, y, alpha = 0)
plot(fit_ridge, xvar = "lambda", label = TRUE, xlab = bquote('log('*lambda*')'), ylab = bquote(beta[j]), main = 'Ridge coefficients')

As one can see, as the value of \(\lambda\) increases, more and more parameters get shrunk to \(0\).

Lasso regression

The lasso involves estimating \(\mathbf{\beta}\) as the solution to the penalized least-squares problem \[\hat{\mathbf{\beta}} = \text{argmin}_{\mathbf{\beta}} \sum_{i=1}^n \left(y_i - \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right)^2 + \lambda ||β||_{1}\]

This penalty function will have the effect of both selecting a set of nonzero \(\beta_{i}\)’s (i.e. sparsifying the estimate) as well as shrinking the nonzero \(\beta_{i}\)’s toward 0. The bigger \(\lambda\), the more aggressive the shrinkage effect.

Now we fit a lasso model; for this we use the default alpha = 1. We fit the lasso model across a range of \(\lambda\) values and we plot the solution path \(\hat{\mathbf{\beta}}^{(\lambda)}\) as a function of \(\lambda\). As one can see, as the value of \(\lambda\) increases, more and more parameters get shrunk to \(0\).

fit_lasso <- glmnet(X, y)
plot(fit_lasso, xvar = "lambda", label = TRUE, xlab = bquote('log('*lambda*')'), ylab = bquote(beta[j]), 
     main = 'Lasso coefficients')

Selecting the tuning hyperparameter

A natural way to choose \(\lambda\) is to minimize the expected out-of-sample prediction error. Suppose that \((\textbf{x}_{\star}, y_{\star})\) is a future data point from the same population/data-generating process as the original data. The goal would be to make the expected error

\[\text{MOOSE}(\mathbf{\beta}^{(\lambda)}) = \mathbb{E} \left[ (y_{\star} - \hat{\beta}^{(\lambda)}_0 - \hat{\beta}^{(\lambda)}_1 x_{\star 1} - \dots - \hat{\beta}^{(\lambda)}_p x_{\star p})^{2} \right]\]

as small as possible. Of course, we do not have any “future data” lying around, so we have to estimate this quantity using the data we have. The in-sample mean-squared error, \(\text{MSE}(\hat{\mathbf{\beta}}^{(\lambda)})\) is generally an optimistic estimate of this quantity: out-of-sample error tends to be worse, on average, than in-sample error, and we need some way of quantifying how much worse. In the following we estimate the MOOSE using cross validation.

Suppose we want to use our earlier train/validation division to select the lambda for the lasso. This is easy to do:

cv_lasso <- cv.glmnet(X, y)
plot(cv_lasso, xlab = TeX("$\\log (\\lambda)$"), ylab = 'CV Error')

cv_ridge <- cv.glmnet(X, y, alpha = 0)
plot(cv_ridge, xlab = TeX("$\\log (\\lambda)$"), ylab = 'CV Error')

However, let us try to implement our own cross validation function (this is going to be useful in more general cases).

cv_penalized_reg <- function(X, y, lambda_grid, K = floor(sqrt(nrow(X)))){
  # Performs K-fold cross validation for selecting lambda in the the 
  # lasso regression
  # ----------------
  # Args:
  #   - X: data matrix
  #   - y: response vector
  #   - lambda_grid: vector of the penalty parameters to consider
  #   - K: number of CV folds
  # Returns:
  #   - MSE and cross validation MSE, along with their standard deviations
  # ----------------------------------------------------------------------
  n <- nrow(X)
  p <- ncol(X)
  folds <- cut(sample(1:n), breaks = K, labels = FALSE)

  train_err <- cv_err <- array(NA, dim = c(K, length(lambda_grid)))
  for(k in 1:K){
    idx_tr <- which(folds != k)
    idx_ts <- which(folds == k)
    
    fit_glm <- glmnet(X[idx_tr,], y[idx_tr], lambda = lambda_grid)
    
    if (length(idx_ts) > 1){
      cv_err[k,] <- as.numeric(colMeans((predict(fit_glm, X[idx_ts,], lambda = lambda_grid) - y[idx_ts])^2))
      train_err[k,] <- as.numeric(colMeans((predict(fit_glm, X[idx_tr,], lambda = lambda_grid) - y[idx_tr])^2))
    }
    else{
      cv_err[k,] <- as.numeric(colMeans((predict(fit_glm, matrix(X[idx_ts,],1,p), lambda = lambda_grid) - y[idx_ts])^2))
      train_err[k,] <- as.numeric(colMeans((predict(fit_glm, matrix(X[idx_tr,],1,p), lambda = lambda_grid) - y[idx_tr])^2))
    }
  }
  cv_MSE <- colMeans(cv_err)
  sd_cv_MSE <- apply(cv_err, 2, sd)/sqrt(K)
  MSE <- colMeans(train_err)
  sd_MSE <- apply(train_err, 2, sd)/sqrt(K)
  
  return(list('cv_MSE' = cv_MSE, 'sd_cv_MSE' = sd_cv_MSE, 
              'MSE' = MSE, 'sd_MSE' = sd_MSE))
}
lambda_grid <- fit_lasso$lambda
K <- 10
fit_err <- cv_penalized_reg(X, y, lambda_grid, K = 10)

plotCI(x = log(lambda_grid), y = fit_err$cv_MSE, uiw = fit_err$sd_cv_MSE, 
       col = 'red', scol = 'gray', pch = 16, lwd = 2, xlab = TeX("$\\log (\\lambda)$"), ylim = c(80000, 230000), ylab = 'Error')
points(x = log(lambda_grid), y = fit_err$MSE, col = 'black', pch = 16)

# Vertical line corresponding to the optimal model
abline(v = log(lambda_grid)[which.min(fit_err$cv_MSE)], lwd = 1, col = 1, lty = 2)
abline(h = min(fit_err$cv_MSE) + fit_err$sd_cv_MSE[which.min(fit_err$cv_MSE)], 
       lwd = 1, lty = 2)
# Find the indexes correponding to the models that are within a standard 
# deviation from the optimal model
idx <- which( min(fit_err$cv_MSE) + fit_err$sd_cv_MSE[which.min(fit_err$cv_MSE)] >= fit_err$cv_MSE)
# Take the most parsimonious of them ()
idx <- idx[1]
abline(v = log(lambda_grid[idx]), lwd = 1, col = 1, lty = 3)
legend('topleft', legend = c('Training Error','CV Error'), pch = 16, col = c('black','red'))

The MSE is always increasing as a function of \(\lambda\) and its minimum is reached for the model with all the covariates included. This is not surprising: in fact, the unpenalized model is the best one in explaining the effect of the covariates on the training set. However, it might be overfitting the data!

The cross-validated estimate of MOOSE across the solution path, as a function of \(\lambda\), is displayed. As one can see, this figure is very different from the one reporting the behaviour of the MSE estimated on the training data. In fact, a minimum can be found inside the interval of possible values of \(\lambda\). If we consider the optimal model as the one satisfying the minimum of the MOOSE, we end up choosing \(\hat{\lambda} = 2.44\). However, another approach is preferable. Since we always try to seek parsimonious models (that is, with high values of \(\lambda\), or with the least number of covariates), we can use the 1 std. dev. criterion. This consists in choosing the most conservative model whose estimated mean MOOSE is within one standard deviation from the mean MOOSE of the optimal model. This approach leads to \(\hat{\lambda} = 83.59\).

lambda_grid[which.min(fit_err$cv_MSE)]
[1] 2.436791
lambda_grid[idx]
[1] 83.59338